Overview

Machine Learning (ML) and Topological Data Analysis (TDA) are different approaches to data analysis, each of which has its own strengths and weaknesses relative to the other.

Load the libraries

I used the following libraries for the analysis and visualization. I don’t show the code for most of the data cleaning and analysis steps to keep the post concise but the code can be found on Github. TDA package is used for its persistent homology capabilities. The TDAmapper implements the Mapper algorithm. The ggmap to visualize spatial data and models on top of static maps from google.

library(readxl)
library(TDA)
library(dplyr)
library(ggplot2)
library(ggmap)
library(TDAmapper)
library(igraph)
library(geosphere)
library(lubridate)

New York City Collision Data

This collision data consist of 160 observations. The collisions took place between 1899-12-31 06:00:00 and 1899-12-31 18:00:00. New York City encompasses five county-level administrative divisions called boroughs: Manhattan, Brooklyn, Queens, The Bronx, and Staten Island. The data does not identify the boroughs.

Data Wrangling

Overall the data file is clean with few missing observations, so here the main data wrangling tasks will include:

  • Creating a time of the day variable (morning, afternoon, evening, night)
  • Injured: a boolean that indicates if a person was injured during the collision
  • Hour: extracting the hour from the date/time field

Data Records

  • Date: the date of collision Format: Year-Month-Date
  • Time: the date/time when collision occurred Format:Year-Month-Date Hour:Minutes:Second UTC
  • Latitude
  • Longitude
  • Person Injured: the number of persons injured

The following are derived fields

  • Time Day: the time of day of collision (afternoon or evening)
  • Hour: the hour of when collision occurred
  • Injured: a true/false flag that indicates if a person was injured

Explanatory Data Analysis

This data set is balanced with equal amounts of accidents occurring in the afternoon and night.

The total number of accidents increases with time.

The majority of accidents did not involved people being injured.

Topological Data Analysis

Clustering (stats package)

Clustering using K-Means Clustering from stats package

In this section, the data was clustered using k-means. The number of cluster used are 1 (none) through 7.

No clustering

In this map we identify a hole that includes east river.

K-Mean K = 2

In this map we identify an holes (parallelogram) in Staten Island.

K-Mean K = 3

In this map we identify holes in Staten Island, Queens and upper east side of Manhattan.

K-Mean K = 4

In this map we identify holes in Staten Island, Queens, lower Manhattan and Harlem.

K-Mean K = 5

In this map we identify holes in Staten Island, Queens, lower Manhattan, Harlem and Brooklyn.

K-Mean K = 6

In this map we identify holes in Staten Island, Queens, lower Manhattan, Harlem, Brooklyn and Bronx.

K-Mean K = 7

In this map we identify an additional holes in Staten Island, Queens, lower Manhattan, Harlem, Brooklyn and Bronx.

K-Mean K = 2 (Injured)

In this map we identify two holes that includes east river and Brooklyn.

K-Mean K = 2 (Time of day)

In this map we identify two holes in Manhattan (afternoon) and Queens/Brooklyn (night).

Clustering using TreeKNN

In this map we identify 20 holes in the 5 Boroughs of New York City.

In this map we identify various holes in the 5 Boroughs of New York City.

In this map we identify various holes in the 5 Boroughs of New York City.

mapper1D to identify figures

Conducted topological data analysis using mapper from the TDAmapper package. Here are the steps to yield the visualization above:

  1. Apply some map (filter) to the data

  2. Use hierarchical clustering to create a cover

  3. Run clustering algorithm

  4. Represent data clusters as nodes, and connect nodes whose clusters overlap

Additional Plotting

Silhouette and landscape

The data shows a simple silhouette and landscape.

## # Generated complex of size: 682800 
## 
## 0%   10   20   30   40   50   60   70   80   90   100%
## |----|----|----|----|----|----|----|----|----|----|
## ***************************************************
## # Persistence timer: Elapsed time [ 7.570188 ] seconds

Conclusion

The TDAmapper and TDA was able to provide more granular information on the NYC collision dataset compared to the ML K-means hierarchical clustering methods. The holes identified are the safest areas of NYC where collisions did not take place.

Next step

Here is a list of possible future analysis that can be performed by joining the current data set with weather and population area.